Set the below parameters for the downstream analysis
#Give names to saved object and name of results repository
name_result_folder<-'TS_results_Tanima/'
obj_name<-'timeSeries_obj_tanima.Rdata'
#Path to count data and sample data respectively
my_path_data<-'~/A_Projects/EpiGen/R_Work_Folder/TimeSeriesAnalysis/data/Tanima_RNAseq/Tanima_counts_formatted'
my_path_sample_dta<-'~/A_Projects/EpiGen/R_Work_Folder/TimeSeriesAnalysis/data/Tanima_RNAseq/sample_data_formatted.csv'
#Set-up time series object parameters
diff_exp_type<-'DESeq2' #package used for DE – can also be 'limma'
p_val_filter_type<-'padj' #Either padj or pvalue, used to filter for significance
p_thresh<-0.05 #pvalue or padj value threshold for significance
l2fc_thresh<-1 #log(2)foldChange threshold for significance
name_control<-'BY273.ctrl' #Name of experiment as seen in the sample file
name_experiment<-'BY273.krill' #Name of control as seen in the sample file
graphic_vector<-c("#e31a1c","#1f78b4") #Pre-set colors for the groups
#Declare organism
org_sem_sim='org.Ce.eg.db'
my_org_gpro='celegans' #Set the species for the gprofiler analysis
PART_l2fc<-2 #log(2)foldChange threshold for PART clustering
PART_min_clust<-50 #Minimum cluster size for PART
PART_recursion<-100 #Number of recursions, default is 100, using 10 for example
#Used to highlight specific genes regardless of differential gene expression significance
# genes_of_interest <- c('AICDA','APOBEC3H','APOBEC3F','APOBEC3D','APOBEC3C','APOBEC3G','APOBEC3B','APOBEC3A','SMUG1','UNG','EGFR')
genes_of_interest<-read.table('~/A_Projects/EpiGen/R_Work_Folder/TimeSeriesAnalysis/data/Tanima_RNAseq/tanima_genes_interest.txt')$V2
#Some extra set-up
name_save_obj<-paste0(name_result_folder,obj_name)#The object will be saved in result folder
#Create main directory for results
dir.create(name_result_folder)
## Warning in dir.create(name_result_folder): 'TS_results_Tanima' already exists
my_group_names<-c(name_experiment,name_control)
names(graphic_vector)<-c(name_experiment,name_control)
This code chunk first checks if a timeseries object has been saved, if that is the case, the object is loaded instead of being created again. To reset this analysis, the saved timeseries object should be deleted or moved from the repository.
If no timeseries object was found, the code chunk creates the timeseries object using the parameters defined above. The raw matrix is also created using the second function
if(obj_name %in% list.files(name_result_folder)==F){
TS_object <- new('TimeSeries_Object',sample_data=prep_sample_data(my_path_sample_dta,my_group_names),
group_names=my_group_names,group_colors=graphic_vector,DE_method=diff_exp_type,
DE_p_filter=p_val_filter_type,DE_p_thresh=p_thresh,DE_l2fc_thresh=l2fc_thresh,
PART_l2fc_thresh=PART_l2fc,sem_sim_org=org_sem_sim,Gpro_org=my_org_gpro)
TS_object <- create_raw_count_matrix(TS_object,my_path_data)
}else{
load(name_save_obj)
}
Loads the necessary functions to perform differential gene expression analysis If the tool used is DESeq2(Love, Huber, and Anders 2014), the data needs to be normalized using it’s method. If the tool used is limma(Smyth 2005), the normalized matrix should have been inputed, and thus no normalization is needed.
This code chunk performs both the conditional and temporal differential gene expression and saves the results within the Time Series object.
#Perform normalization if the DESeq2 tool is being used and if normalized matrix doesn't exist
if (TS_object@DE_method=='DESeq2' & 'norm' %in% names(TS_object@count_matrix)!=T){
TS_object <- normalize_timeSeries_with_deseq2(time_object=TS_object)
}
#Perform conditional differential gene expression analysis
TS_object<-conditional_DE_wrapper(TS_object)
#Perform temporal differential gene expression analysis
TS_object<-temporal_DE_wrapper(TS_object)
#save the timeseries object
save(TS_object,file=name_save_obj)
The code chunk below prepares and initiates PART clustering(Nilsen et al. 2013). It first retrieves the number of significant genes for PART clustering in the ‘signi_genes’ variable This chunk can be quite lengthy depending on the number of genes included for clustering as well as the number of recursions set for the clustering
#Extract genes for PART clustering based on defined log(2)foldChange threshold
signi_genes<-select_genes_with_l2fc(TS_object)
#Use all samples, but implement a custom order. In this case it is reversed
samps_2<-TS_object@sample_data$sample[TS_object@sample_data$group==TS_object@group_names[2]]
samps_1<-TS_object@sample_data$sample[TS_object@sample_data$group==TS_object@group_names[1]]
#Create the matrix that will be used for PART clustering
TS_object<-prep_counts_for_PART(object=TS_object,target_genes=signi_genes,
scale=T,target_samples=c(samps_2,samps_1))
#Initiate PART clustering with a pre-defined seed – 123456
TS_object<-compute_PART(TS_object,part_recursion=PART_recursion,part_min_clust=PART_min_clust,
dist_param="euclidean", hclust_param="average",
custom_seed=123456)
#Save the TimeSeries object to the directory to prevent loss of data in the event
#of a downstream error
save(TS_object,file=name_save_obj)
The code chunk below runs a gprofiler analysis(Kolberg et al. 2020; Raudvere et al. 2019). IMPORTANT: This function requires that there be a stable internet connection lack of connection or intermittent drops in the connection will result in an error and the termination (stop) of the code chunk
If an error has occured, this code chunk can be re-run separately from the above chunks by uncommenting (removing the ‘#’ in front) the “load(‘timeseries_obj_res.Rdata’)” line.
This will load the results saved after the PART clustering code chunk
This code chunk will overwrite the saved object if it is completed. The overwritten object will then contain the gprofiler analysis results and can be used to generate plots with the downstream code chunks.
# load('timeseries_obj_res.Rdata')
TS_object<-run_gprofiler_PART_clusters(TS_object) #Run the gprofiler analysis
#Save the results in a Rdata obbject
save(TS_object,file=name_save_obj)
Most plots are created in SVG format for ease of editing with SVG editing software such as InkScape (open source software). Some plots are created in html format as they are interactive plots. These require a web browser to open them
To convert SVG files to PNG/JPG/PDF, this website is available: https://svgtopng.com/
HTML files can be opened and then saved as PNG (or other) using the camera icon in the top right of each respective interactive plot
Name of groups beings compared: BY273.krill vs BY273.ctrl
Number of genes analyzed: 35862
Differential expression parameters used:
Method: DESeq2
P statistic filter used: padj with a 0.05 threshold
Log(2)FoldChange significance threshold used: 1
PART parameters used:
PART log(2)FoldChange significance threshold: 2
number of recursions: 100
minimum cluster size: 50
distance parameter: euclidean
hclust parameter: average
custom seed used: 123456
PART computation time: 3445.919 sec elapsed
PCA of sample data:
DEGs found per analysis:
| experiment.name | number.of.DEGs |
|---|---|
| BY273.krill_vs_BY273.ctrl_TP_1 (conditional) | 681 |
| BY273.krill_vs_BY273.ctrl_TP_3 (conditional) | 1429 |
| BY273.krill_vs_BY273.ctrl_TP_6 (conditional) | 2105 |
| TP_3_vs_TP_1 (temporal) | 2684 |
| TP_6_vs_TP_3 (temporal) | 1566 |
| total unique genes | 5496 |
Each experiment listed in the above table has separate results which can be viewed in the ‘TS_results’ folder. In the main results folder, the differential gene expression results are located in the ‘DE_results_conditional’ and ‘DE_results_temporal’ plot.
Each individual experiment creates the following:
DE_raw_data.csv: A csv file containing all the genes and their associated differential expression results, regardless of significance.
DE_sig_data: A csv file containing only the significant differential expression results.
MA_plot.png: A MA plot, these are often used to evaluate the quality of the normalization.
volcano_plot.png: A volcano plot which splits significantly up-regulated and down-regulated genes as well as non-significant genes.
PCA_plot: A PCA plot using only the samples implicated in the differential gene expression analysis.
The pipeline also creates two large heatmaps which summarize conditional and temporal differential expression results. The heatmaps were created using the ComplexHeatmaps package(Gu, Eils, and Schlesner 2016).
These heatmaps illustrate each experiment as different columns (distinguished by color) while the rows distinguish the groups. Plotted in the heatmaps is the log transformed counts, while the log transformed log(2)fold change is seen in a histogram at the bottom of the heatmap. The number of genes in each experiment (colored columns) is indicated in the legend at the bottom of the heatmap.
In the case of the temporal heatmap, the rows are labelled ‘experiment’ and ‘control.’ The ‘experiment’ represents the time points being compared while the control represents the timepoint which the ‘experiment’ is compared against. For example, TP2_vs_TP1 would have the ‘experiment’ be TP2 and the control be TP1, while TP3_vs_TP2 would have TP3 as the ‘experiment’ and TP2 as the ‘control.’ Note that TP=time point.
The conditional heatmap